尽管变形金刚及其变体构象体在语音识别方面表现出了有希望的表现,但参数化的属性在训练和推理过程中导致了很大的记忆成本。一些作品使用跨层重量分享来减少模型的参数。但是,不可避免的能力损失会损害模型性能。为了解决这个问题,本文提出了通过共享稀疏门控专家的参数效率构象异构体。具体而言,我们使用稀疏门控的专家(MOE)来扩展构型块的容量而不增加计算。然后,共享分组构象块的参数,以减少参数的数量。接下来,为了确保具有不同级别适应表示的灵活性的共享块,我们会单独设计MOE路由器和标准化。此外,我们使用知识蒸馏来进一步提高性能。实验结果表明,与全参数模型相比,所提出的模型用编码器的1/3来实现竞争性能。
translated by 谷歌翻译
安全的基于多方计算的机器学习(称为MPL)已成为利用来自具有隐私保护的多个政党的数据的重要技术。尽管MPL为计算过程提供了严格的安全保证,但MPL训练的模型仍然容易受到仅依赖于访问模型的攻击。差异隐私可以帮助防御此类攻击。但是,差异隐私和安全多方计算协议的巨大沟通开销带来的准确性损失使得平衡隐私,效率和准确性之间的三通权衡是高度挑战的。在本文中,我们有动力通过提出一种解决方案(称为PEA(私有,高效,准确))来解决上述问题,该解决方案由安全的DPSGD协议和两种优化方法组成。首先,我们提出了一个安全的DPSGD协议,以在基于秘密共享的MPL框架中强制执行DPSGD。其次,为了减少因差异隐私噪声和MPL的巨大通信开销而导致的准确性损失,我们提出了MPL训练过程的两种优化方法:(1)与数据无关的功能提取方法,旨在简化受过训练的模型结构体; (2)基于本地数据的全局模型初始化方法,旨在加快模型训练的收敛性。我们在两个开源MPL框架中实施PEA:TF-Conteded和Queqiao。各种数据集的实验结果证明了PEA的效率和有效性。例如。当$ {\ epsilon} $ = 2时,我们可以在LAN设置下的7分钟内训练CIFAR-10的差异私有分类模型,其精度为88%。这一结果大大优于来自CryptGPU的一个SOTA MPL框架:在CIFAR-10上训练非私有性深神经网络模型的成本超过16小时,其精度相同。
translated by 谷歌翻译
样本分配在现代对象检测方法中起着重要的作用。但是,大多数现有的方法都依靠手动设计来分配正 /负样本,这些样本并未明确建立样本分配和对象检测性能之间的关系。在这项工作中,我们提出了一种基于高参数搜索的新型动态样本分配方案。我们首先将分配给每个地面真理的正样本的数量定义为超参数,并采用替代优化算法来得出最佳选择。然后,我们设计一个动态的样本分配过程,以动态选择每个训练迭代中的最佳阳性数量。实验表明,所得的HPS-DET在不同对象检测基线的基线上带来了改善的性能。此外,我们分析了在不同数据集之间和不同骨架之间转移的高参数可重复使用性,以进行对象检测,这表现出我们方法的优势和多功能性。
translated by 谷歌翻译
在少数射击域适应(FDA)中,针对目标域的分类器在源域(SD)(SD)中使用可访问的标记数据进行训练,而目标域(TD)中的标记数据很少。但是,数据通常包含当前时代的私人信息,例如分布在个人电话上的数据。因此,如果我们直接访问SD中的数据以训练目标域分类器(FDA方法要求),则将泄漏私人信息。在本文中,为了彻底防止SD中的隐私泄漏,我们考虑了一个非常具有挑战性的问题设置,必须使用很少的标签目标数据和训练有素的SD分类器对TD的分类器进行培训,并将其命名为几个示例的假设适应(FHA)。在FHA中,我们无法访问SD中的数据,因此,SD中的私人信息将得到很好的保护。为此,我们提出了一个目标定向的假设适应网络(TOHAN)来解决FHA问题,在该问题中,我们生成了高度兼容的未标记数据(即中间域),以帮助培训目标域分类器。 Tohan同时保持了两个深网,其中一个专注于学习中间域,而另一个则要照顾中间靶向分布的适应性和目标风险最小化。实验结果表明,Tohan的表现要优于竞争基线。
translated by 谷歌翻译
在新颖的类发现(NCD)中,我们从可见的类别和看不见的类别的未标记的数据中给出了标记的数据,并为看不见的类培训聚类模型。但是,NCD背后的隐含假设仍不清楚。在本文中,我们揭开了NCD背后的假设,并发现应在可见和看不见的类中共享高级语义特征。基于这一发现,在某些假设下,NCD在理论上是可以解决的,并且可以自然地与具有与NCD完全相同的假设的元学习链接。因此,我们可以通过经过轻微修改后的元学习算法来实证解决NCD问题。正如实验中所证明的那样,这种基于元学习的方法可显着减少培训所需的未标记数据的数量,并使其更加实用。 NCD的应用程序方案也证明了非常有限的数据的使用:由于仅标记Seep类数据是不自然的,因此NCD是采样而不是因果关系标记。因此,应在收集可见级数据的方式上收集看不​​见的级数据,这就是为什么它们是新颖的,首先需要聚类的原因。
translated by 谷歌翻译
Masked image modeling (MIM) performs strongly in pre-training large vision Transformers (ViTs). However, small models that are critical for real-world applications cannot or only marginally benefit from this pre-training approach. In this paper, we explore distillation techniques to transfer the success of large MIM-based pre-trained models to smaller ones. We systematically study different options in the distillation framework, including distilling targets, losses, input, network regularization, sequential distillation, etc, revealing that: 1) Distilling token relations is more effective than CLS token- and feature-based distillation; 2) An intermediate layer of the teacher network as target perform better than that using the last layer when the depth of the student mismatches that of the teacher; 3) Weak regularization is preferred; etc. With these findings, we achieve significant fine-tuning accuracy improvements over the scratch MIM pre-training on ImageNet-1K classification, using all the ViT-Tiny, ViT-Small, and ViT-base models, with +4.2%/+2.4%/+1.4% gains, respectively. Our TinyMIM model of base size achieves 52.2 mIoU in AE20K semantic segmentation, which is +4.1 higher than the MAE baseline. Our TinyMIM model of tiny size achieves 79.6% top-1 accuracy on ImageNet-1K image classification, which sets a new record for small vision models of the same size and computation budget. This strong performance suggests an alternative way for developing small vision Transformer models, that is, by exploring better training methods rather than introducing inductive biases into architectures as in most previous works. Code is available at https://github.com/OliverRensu/TinyMIM.
translated by 谷歌翻译
Few Shot Instance Segmentation (FSIS) requires models to detect and segment novel classes with limited several support examples. In this work, we explore a simple yet unified solution for FSIS as well as its incremental variants, and introduce a new framework named Reference Twice (RefT) to fully explore the relationship between support/query features based on a Transformer-like framework. Our key insights are two folds: Firstly, with the aid of support masks, we can generate dynamic class centers more appropriately to re-weight query features. Secondly, we find that support object queries have already encoded key factors after base training. In this way, the query features can be enhanced twice from two aspects, i.e., feature-level and instance-level. In particular, we firstly design a mask-based dynamic weighting module to enhance support features and then propose to link object queries for better calibration via cross-attention. After the above steps, the novel classes can be improved significantly over our strong baseline. Additionally, our new framework can be easily extended to incremental FSIS with minor modification. When benchmarking results on the COCO dataset for FSIS, gFSIS, and iFSIS settings, our method achieves a competitive performance compared to existing approaches across different shots, e.g., we boost nAP by noticeable +8.2/+9.4 over the current state-of-the-art FSIS method for 10/30-shot. We further demonstrate the superiority of our approach on Few Shot Object Detection. Code and model will be available.
translated by 谷歌翻译
We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io
translated by 谷歌翻译
In this paper, a semantic communication framework for image transmission is developed. In the investigated framework, a set of servers cooperatively transmit images to a set of users utilizing semantic communication techniques. To evaluate the performance of studied semantic communication system, a multimodal metric is proposed to measure the correlation between the extracted semantic information and the original image. To meet the ISS requirement of each user, each server must jointly determine the semantic information to be transmitted and the resource blocks (RBs) used for semantic information transmission. We formulate this problem as an optimization problem aiming to minimize each server's transmission latency while reaching the ISS requirement. To solve this problem, a value decomposition based entropy-maximized multi-agent reinforcement learning (RL) is proposed, which enables servers to coordinate for training and execute RB allocation in a distributed manner to approach to a globally optimal performance with less training iterations. Compared to traditional multi-agent RL, the proposed RL improves the valuable action exploration of servers and the probability of finding a globally optimal RB allocation policy based on local observation. Simulation results show that the proposed algorithm can reduce the transmission delay by up to 16.1% compared to traditional multi-agent RL.
translated by 谷歌翻译
Learning the underlying distribution of molecular graphs and generating high-fidelity samples is a fundamental research problem in drug discovery and material science. However, accurately modeling distribution and rapidly generating novel molecular graphs remain crucial and challenging goals. To accomplish these goals, we propose a novel Conditional Diffusion model based on discrete Graph Structures (CDGS) for molecular graph generation. Specifically, we construct a forward graph diffusion process on both graph structures and inherent features through stochastic differential equations (SDE) and derive discrete graph structures as the condition for reverse generative processes. We present a specialized hybrid graph noise prediction model that extracts the global context and the local node-edge dependency from intermediate graph states. We further utilize ordinary differential equation (ODE) solvers for efficient graph sampling, based on the semi-linear structure of the probability flow ODE. Experiments on diverse datasets validate the effectiveness of our framework. Particularly, the proposed method still generates high-quality molecular graphs in a limited number of steps.
translated by 谷歌翻译